Add new AtU8 beam chunk #1078

josevalim · 2016-05-31T14:04:25Z

The new chunk stores atoms encoded in UTF-8.

This is still work in progress and the following tasks are still pending:

Add the compile option r19 that will compile atoms to the old "Atom" chunk as mentioned by @bjorng in the mailing list
Support 255 UTF-8 codepoints in binary_to_atom instead of 255 bytes. The issue is that calculating the number of characters is linear to the binary size. Although erts_atom_put performs such calculation, it returns a NON-VALUE for both invalid encoding and large binaries errors, and the test suite expects badarg for invalid encoding and system limit for large binaries. To solve this, we can either:
1. add a new function, used by both erts_atom_put and binary_to_atom, that properly encodes the two kinds of errors so we can act accordingly
2. change binary_to_atom to only raise badarg and no longer raise on system_limit
3. compute the size in the binary_to_atom BIF and continue raising a system limit error (effectively traversing the binary twice)
Thoughts? Any other options? Once this is changed, the docs shall be properly updated.

Feedback request:

beam_lib has been modified to support a new 'utf8_atoms' chunk that maps to the new "AtU8" chunk. This means accessing the old "Atom" chunk can now be missing although it is straight-forward to handle it with a case statement.
list_to_atom has not been modified and it will still fail if given a character more than 255. The reason I have decided to not change list_to_atom is because its current documentation does not say it may support characters more than 255 in the future (while the binary_to_atom has a very explicit warning about such).

ferd · 2016-05-31T14:43:05Z

What is meant by an UTF8 codepoint? I thought that by the time you're in UTF-8 you no longer deal in codepoints, just in bytes?

The Unicode consortium does specify that lengths can be counted in one of 4 ways. From the FAQ using the string aनि亜𐂃 (U+0061, U+0928, U+093F, U+4E9C, U+10083)

bytes: UTF-8: 14, UTF-16: 12, UTF-32: 20
code units (position in a string with the given size): UTF-8: 14, UTF-16: 6, UTF-32: 5
code points (how many unicode code points are in there): 5 for all, since there is a combining mark
grapheme clusters: 4 for all of them, since this is a logical grouping for the length based on 'characters' as we people usually think of them, rather than the implementation of them.

The two latter are not encoding-specific. What is meant exactly by UTF-8 code points? Just code points, or it's a bad term for code units (and hence bytes)?

josevalim · 2016-05-31T14:53:40Z

Hey @ferd, I meant literally code points as in your definition:

code points (how many unicode code points are in there): 5 for all, since there is a combining mark

It is also worth mentioning the implementation today (prior to this patch) also checks code points. This can be checked in the source code or by trying a quick example:

3> binary_to_atom(binary:copy(<<"é"/utf8>>, 255), utf8).
ééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé
4> byte_size(binary:copy(<<"é"/utf8>>, 255)).
510
5> binary_to_atom(binary:copy(<<"é"/utf8>>, 256), utf8).
** exception error: a system limit has been reached
     in function  binary_to_atom/2
        called as binary_to_atom(<<"éééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééééé"/utf8...>>,
                                 utf8)

So this pull request is currently backwards incompatible (hence the pending tasks). The question is mostly how to implement the new binary_to_atom efficiently.

margnus1 · 2016-05-31T20:42:47Z

erts/emulator/beam/beam_load.c

@@ -6332,7 +6355,7 @@ erts_make_stub_module(Process* p, Eterm Mod, Eterm Beam, Eterm Info)
 	goto error;
    }
    define_file(stp, "atom table", ATOM_CHUNK);
-    if (!load_atom_table(stp)) {
+    if (!load_atom_table(stp, ERTS_ATOM_ENC_LATIN1)) {


You'll need to go through the same if (stp->chunks[UTF8_ATOM_CHUNK].size > 0) ... charade down here, or code:make_stub_module/3 (and thus HiPE) will break. There's some tests for it in erts/emulator/test/code_SUITE*

Thank you. I have run the code suite and fixed this and the remaining failing tests, I have updated this PR.

OTP-Maintainer · 2016-06-01T18:42:10Z

Patch has passed first testings and has been assigned to be reviewed

I am a script, I am not human

bjorng · 2016-06-03T04:58:20Z

We will not have time to do a thorough review until OTP 19.0 has been released, but here is some quick feedback:

The test case lc_SUITE:effect/1 fails.

binary_to_atom(): Solution i, that is keep the existing error semantics and implement it efficiently.

Not modifying list_to_atom/1 is an interesting idea that we have not considered.

nox · 2016-06-03T13:12:28Z

@bjorng Can we name the chunk 'JOSÉ'? PLS.

bjorng · 2016-09-22T14:09:02Z

We will take a closer at this PR request when you have fixed the failed test case mentioned above and rebased it on the latest master.

nox · 2016-09-22T14:13:25Z

It is also worth mentioning the implementation today (prior to this patch) also checks code points. This can be checked in the source code or by trying a quick example:

I think you interpreted that incorrectly. Currently only UTF-8 encoded atoms that can be converted to Latin-1 are supported, so it just counts the number of Latin-1 code points, which equates the number of Latin-1 code units, which equates the number of bytes in the result.

josevalim · 2016-09-22T14:45:24Z

@bjorng I will allocate time in the next weeks to work on all of my pending pull requests. :D

While talking to @nox, we came up with another solution for the second bullet above. @nox commented that it would be much simpler to count bytes instead of counting codepoints. However, if we count bytes and keep the limit of 255 bytes, it would be backwards incompatible because an atom made only of the letter é in latin1 will now take double the space in UTF-8, possibly triggering the system limit.

However, if we double the maximum atom size, we could keep the conceptually simpler and faster code that only counts bytes and remain backwards compatible.

From the Elixir perspective, Increasing the atom limit would be a positive change, since function names must be atoms, sometimes we may reach this limit when building functions dynamically.

@bjorng thoughts on this suggestion?

bjorng · 2016-09-24T05:11:52Z

I think that the BIFs that create atoms should count code points, and let the limit be 255 code points. When converting from a list, the number of code points is the number of elements in the list (assuming that we don't try to combine code points that can be combined).

The chunk in the BEAM file can give the size of each atom in bytes, and the loader can trust the compiler not to create overlong atoms. (There could be a sanity check to abort the loading if any atom is longer than 1024 bytes.)

bjorng · 2016-09-24T05:18:31Z

Rationale for counting code points: We don't want to handle support request from Klingon users that complain that they can only put 85 Klingon characters into their atoms, while speakers of most west-European languages can have 255 characters in their atoms.

nox · 2016-09-24T12:13:08Z

What about following what ROK proposed in EEP20?

2 bytes for byte length, 2 bytes for code point count.

josevalim · 2016-09-26T11:26:11Z

erts/emulator/beam/atom.c

-Eterm
-erts_atom_put(const byte *name, int len, ErtsAtomEncoding enc, int trunc)
+int
+erts_atom_put_index(const byte *name, int len, ErtsAtomEncoding enc, int trunc)


@bjorng I have introduced this function that returns the atom index instead of the atom so we have control upstream if we should error with badarg or system_limit. I have two questions I would love your input on:

Should the name have the erts_ prefix?

Should we use #define ATOM_BAD_ENCODING and #define ATOM_MAX_CHARS instead of using integers? If so, is it fine to define those on atom.h?

Notice the new function and its return values are used in binary_to_atom (erl_unicode.c). The PR has also been rebased. Once those questions are answered, I will squash everything and we should be good to go.

The convention for the erts_ prefix is to use it for global functions. Older function, or some very often used functions, e.g. size_object does not use it though.

Answer for 2: Yes, and yes.

josevalim · 2016-09-26T14:50:11Z

@bjorng This is ready for review. I have rebased, fixed all feedback and features.

I have ran emulator, compiler and stdlib test suites with the following results:

emulator: One failure regarding registered sends across nodes (I inspected the code and it seems to not related be to this patch)
compiler: No failures
stdlib: One failure in the shell suite which seems unrelated

I have also added a test that compiles a module from the Erlang Abstract Format with utf8 atoms, to ensure the whole compilation chain works, and also a test for the r19 option.

OTP-Maintainer · 2016-09-26T18:07:44Z

Patch has passed first testings and has been assigned to be reviewed

I am a script, I am not human

OTP-Maintainer · 2016-10-01T18:07:11Z

Patch has passed first testings and has been assigned to be reviewed

I am a script, I am not human

bjorng · 2016-10-12T14:49:05Z

Three test cases fail in compile_SUITE: core, core_roundtrip, and asm.

I updated the primary bootstrap and compiled from a clean repository.

josevalim · 2016-10-12T14:55:15Z

I updated the primary bootstrap and compiled from a clean repository.

How do I update the primary bootstrap? I remember doing a clean build but I did not touch the bootstrap files. Thanks!

bjorng · 2016-12-19T12:15:05Z

I found and fixed another minor bug. Please squash and rebase on latest master.

josevalim · 2016-12-19T12:50:31Z

Done and done!

Btw the reducing memory during compilation I had to rebase on may help Elixir too. 👍

bjorng · 2017-01-13T11:16:47Z

Can you please rebase again?

We will test the branch in our daily builds again to see that there are no remaining failed test cases.

josevalim · 2017-01-13T18:09:47Z

@bjorng done!

bjorng · 2017-01-25T14:50:00Z

This branch no longer seems to cause any problems in our daily builds. We are thinking about merging it soon. More work is probably needed in some applications, for example to ensure that "~ts" is used as format specifier when an atom is displayed, but we can do that in separate branches.

However, I think that the documentation should be updated in this branch to reflect the changes, at least the BIFs for the atom BIFs. Do you think you could update the documentation?

josevalim · 2017-01-25T15:36:05Z

I will push the docs soon. :) Two questions:

Regarding ~ts, I understand why it is needed for binaries and char lists, but it shouldn't be necessary for atoms assuming that atoms do have the encoding they are written in? So I suggest making ~s work regardless of the atom encoding. Or do you see reasons where this would be a bad idea™?
Should we explicitly document module names can now be UTF-8? If so, I would like to add a test to compile_SUITE where we compile Erlang Forms module with a UTF-8 atom as name.

josevalim · 2017-01-25T15:59:32Z

I have pushed a separate commit with docs. Once review is done, I can squash it all together.

bjorng · 2017-01-26T07:37:43Z

Making ~s work for any atom is an interesting idea. I will discuss it with the team.
We have not reached a final decision yet. Our current thinking is that we will allow Unicode characters in module names, but we will not recommend it because of potential portability problems having to do with the file names (Linux does not enforce any particular encoding for file names, and their could be other potential pitfalls when moving files from one operating system to another). The reason that we will probably allow it is because those potential problem already exists for module names with Latin1 characters.

josevalim · 2017-01-26T12:31:53Z

Our current thinking is that we will allow Unicode characters in module names, but we will not recommend it because of potential portability problems having to do with the file names

I agree. Let me know what are your decisions regarding 1 and 2 and I will gladly add tests and improve the docs.

bjorng · 2017-01-30T12:20:32Z

There may some time before we can reach a decision. We will merge this branch and do further corrections/improvement in separate pull requests.

Please squash the commits. When you have done it, we will run the branch once more in our daily builds and then merge it.

The new chunk stores atoms encoded in UTF-8. beam_lib has also been modified to handle the new 'utf8_atoms' attribute while the 'atoms' attribute may be a missing chunk from now on. The binary_to_atom/2 BIF can now encode any utf8 binary with up to 255 characters. The list_to_atom/1 BIF can now accept codepoints higher than 255 with up to 255 characters (thanks to Björn Gustavsson).

josevalim · 2017-01-30T14:25:19Z

Squashed and pushed.

nox · 2017-01-30T14:29:47Z

do further corrections/improvement in separate pull requests.

Like renaming the chunk to 'JOSÉ', right?

josevalim · 2017-02-01T12:10:03Z

🎉

KronicDeth · 2017-08-07T04:53:21Z

Support for decompiling AtU8 chunks has been added to IntelliJ Elixir in KronicDeth/intellij-elixir#777, which will be released in version 6.0.0.

psyeugenic added team:VM Assigned to OTP team VM team:MW feature labels May 31, 2016

margnus1 reviewed May 31, 2016
View reviewed changes

josevalim force-pushed the jv-atu8-chunk branch from 0ad2fdc to 8b5121e Compare June 1, 2016 10:23

bjorng self-assigned this Jun 3, 2016

psyeugenic added the kanban label Jun 13, 2016

bjorng added the waiting waiting for changes/input from author label Sep 22, 2016

josevalim force-pushed the jv-atu8-chunk branch from 8b5121e to 7367a1f Compare September 26, 2016 11:23

josevalim commented Sep 26, 2016

View reviewed changes

josevalim force-pushed the jv-atu8-chunk branch from 7367a1f to e426d18 Compare September 26, 2016 14:41

josevalim force-pushed the jv-atu8-chunk branch from e426d18 to f4424da Compare September 26, 2016 21:02

bjorng removed the waiting waiting for changes/input from author label Oct 5, 2016

bjorng added the waiting waiting for changes/input from author label Oct 12, 2016

bjorng added testing currently being tested, tag is used by OTP internal CI and removed waiting waiting for changes/input from author labels Dec 7, 2016

bjorng removed the testing currently being tested, tag is used by OTP internal CI label Dec 14, 2016

bjorng added the waiting waiting for changes/input from author label Dec 19, 2016

josevalim force-pushed the jv-atu8-chunk branch from 4bb624c to b58cd32 Compare December 19, 2016 12:48

bjorng removed the waiting waiting for changes/input from author label Dec 19, 2016

bjorng added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels Jan 9, 2017

josevalim force-pushed the jv-atu8-chunk branch from b58cd32 to d14cae9 Compare January 13, 2017 18:09

bjorng added the testing currently being tested, tag is used by OTP internal CI label Jan 14, 2017

josevalim force-pushed the jv-atu8-chunk branch from 7257839 to 26b59df Compare January 30, 2017 14:24

bjorng merged commit 26b59df into erlang:master Feb 1, 2017

josevalim mentioned this pull request Feb 1, 2017

Introduce a new core pass called sys_core_alias #1080

Merged

KronicDeth mentioned this pull request Aug 7, 2017

Decompilation not working with OTP 20 using AtU8 chunk for ASCII atoms KronicDeth/intellij-elixir#772

Closed

bjorng mentioned this pull request Oct 8, 2024

compiler: Support long UTF-8 encoded atoms #8913

Merged

Add new AtU8 beam chunk #1078

Add new AtU8 beam chunk #1078

Conversation

josevalim commented May 31, 2016 • edited Loading

ferd commented May 31, 2016

josevalim commented May 31, 2016 • edited Loading

margnus1 May 31, 2016

Choose a reason for hiding this comment

josevalim Jun 1, 2016 • edited Loading

Choose a reason for hiding this comment

OTP-Maintainer commented Jun 1, 2016

bjorng commented Jun 3, 2016

nox commented Jun 3, 2016

bjorng commented Sep 22, 2016

nox commented Sep 22, 2016

josevalim commented Sep 22, 2016

bjorng commented Sep 24, 2016

bjorng commented Sep 24, 2016

nox commented Sep 24, 2016

josevalim Sep 26, 2016

Choose a reason for hiding this comment

psyeugenic Sep 26, 2016

Choose a reason for hiding this comment

bjorng Sep 26, 2016 • edited Loading

Choose a reason for hiding this comment

josevalim commented Sep 26, 2016 • edited Loading

OTP-Maintainer commented Sep 26, 2016

OTP-Maintainer commented Oct 1, 2016

bjorng commented Oct 12, 2016

josevalim commented Oct 12, 2016

bjorng commented Dec 19, 2016

josevalim commented Dec 19, 2016

bjorng commented Jan 13, 2017

josevalim commented Jan 13, 2017

bjorng commented Jan 25, 2017

josevalim commented Jan 25, 2017

josevalim commented Jan 25, 2017

bjorng commented Jan 26, 2017

josevalim commented Jan 26, 2017

bjorng commented Jan 30, 2017

josevalim commented Jan 30, 2017

nox commented Jan 30, 2017

josevalim commented Feb 1, 2017

KronicDeth commented Aug 7, 2017

josevalim commented May 31, 2016 •

edited

Loading

josevalim commented May 31, 2016 •

edited

Loading

josevalim Jun 1, 2016 •

edited

Loading

bjorng Sep 26, 2016 •

edited

Loading

josevalim commented Sep 26, 2016 •

edited

Loading